Skip to content

fix: Set Substrait output types for expressions#20597

Merged
gabotechs merged 7 commits into
apache:mainfrom
wlhjason:GH-15831-substrait-output-types
May 27, 2026
Merged

fix: Set Substrait output types for expressions#20597
gabotechs merged 7 commits into
apache:mainfrom
wlhjason:GH-15831-substrait-output-types

Conversation

@wlhjason

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The Substrait producer did not set the ScalarFunction output_type when converting binary expressions, which broke consumers relying on the output_type.

What changes are included in this PR?

  • Refactor from_join and from_between to eliminate direct calls to make_binary_op_scalar_func
  • Set the Substrait ScalarFunction output_type when converting several types of DataFusion expressions:
    • Binary expressions (Expr::BinaryExpr)
    • Unary expressions (like Expr::Not)
    • Scalar functions (Expr::ScalarFunction)

There are a few more places where the output_type has not been set, such as from_like and from_in_list, as mentioned in #15831. I've left these out of scope here as fixing them would require more substantial code changes.

Are these changes tested?

Yes, via a new unit test.

Are there any user-facing changes?

No, beyond the Substrait output fix.

@github-actions github-actions Bot added the substrait Changes to the substrait crate label Feb 27, 2026
@wlhjason

Copy link
Copy Markdown
Contributor Author

@gabotechs Could you possibly take a look at this or suggest another reviewer? Thanks!

@github-actions

Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale PR has not had any activity for some time label May 10, 2026
@wlhjason wlhjason force-pushed the GH-15831-substrait-output-types branch 2 times, most recently from 0fdbc92 to 6d0fbdc Compare May 16, 2026 18:04
@wlhjason

Copy link
Copy Markdown
Contributor Author

@alamb Any chance you could help me find a reviewer for this one? (I'm not sure how I should go about this as a first-time DataFusion contributor). Many thanks!

@github-actions github-actions Bot removed the Stale PR has not had any activity for some time label May 17, 2026
@alamb

alamb commented May 18, 2026

Copy link
Copy Markdown
Contributor

I @wlhjason -- thanks for the ping.

Maybe @westonpace or @gabotechs have time to review this one (or know someone who does)

@gabotechs

Copy link
Copy Markdown
Contributor

Requesting backup now

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. The output type is not optional according to Substrait so the previous behavior was not compliant with the spec. This change brings things into compliance.

Comment thread datafusion/substrait/src/logical_plan/producer/expr/scalar_function.rs Outdated
Comment thread datafusion/substrait/src/logical_plan/producer/rel/join.rs
Comment thread datafusion/substrait/src/logical_plan/producer/rel/join.rs
@westonpace

Copy link
Copy Markdown
Member

Here's some feedback from Claude:

  P1 — from_join behavioral change is observable on the wire

  The old code combined on and filter as AND(reduce_left(on_conditions), filter). The new code uses conjunction(on_conditions.chain(filter)), which folds right in DataFusion's conjunction
  helper (it uses Iterator::reduce and Expr::and, but the associativity of the resulting tree may differ). Semantically equivalent, but the emitted Substrait byte representation will change
  for joins with 2+ ON predicates plus a filter. Worth:
  - Calling this out in the PR description.
  - Adding a join-with-filter test (or sqllogictest) that exercises both on and filter simultaneously and asserts the produced plan, so future refactors don't silently change wire format
  again.

  P2 — Test coverage is thin

  Only one new test, covering the simplest case (int64 + int64). Given the scope (binary, unary, scalar, higher-order, join refactor, between refactor), consider adding:
  - Unary case (Expr::Not(...) → Boolean, nullable matches input).
  - A nullable-vs-non-nullable case to confirm is_nullable() propagates into the Substrait type's nullability.
  - A from_between test that round-trips both BETWEEN and NOT BETWEEN — the refactor here is non-trivial and lacks direct coverage.
  - A from_join test asserting the join expression structure after the refactor.

  P3 — Expr::to_field(schema)? cost

  Expr::to_field recursively re-derives types from the schema. With this change, every from_binary_expr call now does a to_field over the whole subexpression and recurses into each child,
  which itself re-runs to_field on its subtree. That makes typing O(depth × nodes) over the expression tree. For typical SQL this is unnoticeable, but for generated/synthetic deep
  expressions it can be a quadratic blow-up. Not a blocker, but worth knowing — caching (data_type, nullable) on the way down (or accepting an already-computed type from the caller) would
  fix it.

  P4 — Minor cleanups

  - Expr::ScalarFunction(fun.clone()).to_field(schema)? and Expr::BinaryExpr(expr.clone()).to_field(schema)? clone the expression solely to call a trait method on &Expr. If ExprSchemable
  exposed equivalents on the inner types (ScalarFunction, BinaryExpr), the clones could be avoided. Not introduced by this PR, but the new code makes them more prevalent.
  - In from_between, the *expr.clone() pattern occurs four times in the BETWEEN-not-negated branch:
  Expr::and(
      Expr::lt_eq(*low.clone(), *expr.clone()),
      Expr::lt_eq(*expr.clone(), *high.clone()),
  )
  - A let expr_value = (*expr).clone(); once at the top would tighten this.
  - The new test is #[tokio::test] but the body is fully synchronous — #[test] is sufficient.
  - to_substrait_type(producer, output_field.data_type(), output_field.is_nullable()) is called 3 times. The existing to_substrait_type_from_field would be a slightly cleaner call (after
  wrapping Field → FieldRef). Pure nit.

I don't think I'm two worried about P1. The plans are still roughly equivalent. Might be worth just calling out in the PR description.

I agree with P2, in particular, we should have a test for the join case.

I think P3 is inevitable and somewhat accepted at the moment.

Agree that the things in P4 are minor, take them or leave them.

@wlhjason wlhjason force-pushed the GH-15831-substrait-output-types branch from 876cf5f to d560e09 Compare May 23, 2026 21:41
@wlhjason

Copy link
Copy Markdown
Contributor Author

Many thanks for the review, @westonpace! I think I've addressed all the key points now.

@westonpace

Copy link
Copy Markdown
Member

I think I've addressed all the key points now.

Agreed. I don't have merge privilege but maybe @alamb can pull the trigger?

@wlhjason wlhjason force-pushed the GH-15831-substrait-output-types branch from d560e09 to 53f2bb3 Compare May 26, 2026 22:40
@wlhjason

Copy link
Copy Markdown
Contributor Author

Rebased on main and resolved conflicts from 9a6f67e

@gabotechs gabotechs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wlhjason for the PR and @westonpace for the review! this LGTM, I'll merge it soon.

@gabotechs gabotechs added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@gabotechs gabotechs added this pull request to the merge queue May 27, 2026
Merged via the queue into apache:main with commit 77240f9 May 27, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure Substrait producer for BinaryExpr includes output_type

4 participants